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Abstract 

Recurrent neural networks (RNNs) are widely used 
in computational neuroscience and machine learn- 
ing applications. In an RNN, each neuron com- 
putes its output as a nonlinear function of its in- 
tegrated input. While the importance of RNNs, 
especially as models of brain processing, is undis- 
puted, it is also widely acknowledged that the com- 
putations in standard RNN models may be an over- 
simplification of what real neuronal networks com- 
pute. Here, we suggest that the RNN approach 
may be made both neurobiologically more plausi- 
ble and computationally more powerful by its fu- 
sion with Bayesian inference techniques for non- 
linear dynamical systems. In this scheme, we use 
an RNN as a generative model of dynamic input 
caused by the environment, e.g. of speech or kine- 
matics. Given this generative RNN model, we de- 
rive Bayesian update equations that can decode its 
output. Critically, these updates define a 'recogniz- 
ing RNN' (rRNN), in which neurons compute and 
exchange prediction and prediction error messages. 
The rRNN has several desirable features that a con- 
ventional RNN does not have, for example, fast de- 
coding of dynamic stimuli and robustness to ini- 
tial conditions and noise. Furthermore, it imple- 
ments a predictive coding scheme for dynamic in- 
puts. We suggest that the Bayesian inversion of 
recurrent neural networks may be useful both as a 
model of brain function and as a machine learning 
tool. We illustrate the use of the rRNN by an ap- 



plication to the online decoding (i.e. recognition) 
of human kinematics. 



1 Introduction 

Recurrent neural networks (RNN) have been used 
for many years now to augment nonlinear map- 



pings with a dynamic representation (Pearlmut 



ter||T9 89; Wilhams and Zipser[ [l989 Na rendra and 
Parthasarathy^ |1990[ Jaeger [2001; ,Maass et al. 
2002), for example, for the classification of sen- 



sory input in machine learning. In computational 
neuroscience, RNNs are extensively used to inves- 
tigate the dynamic properties of cortical networks 



(e-g- 


Buonomano and Maass 


2009 


Legenstein and 


Maass 


2007 


), to model the measured activity of 



(e-g. 


Fristonetal. 2003 


Kiebel 


|Sotero et al.l 


2007 


Rodrigues 



et al.[ 2010) and more generally to model brain 



processes like perception, memory and attention 



(e-g- 


Elman 


1990 Miller and Cohen 


2001 


Hamker 


2005 


). The recurrent connections of these net- 



works capture two of the most prominent features 
of neuronal networks observed in the brain: Firstly, 
connections between two neurons are rarely uni- 
directional but more often bi-directional, poten- 
tially via more than one synapse. Secondly, neu- 
rons perform highly nonlinear operations, i.e. they 
transform their input to spiking output. Recurrent 
neural networks capture both these features where 
often the input (post-synaptic potentials) and out- 
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put (action potentials) are replaced by summary 
measures, i.e. the post-membrane potential func- 
tion and firing rate. In such a continuous-time 
RNN, each neuron (often called unit) performs a 
simple operation: In each moment in time, it ap- 
plies a nonlinear function to the sum of its input 
and passes this on to other neurons. This sim- 
ple mechanism can provide for extremely rich pat- 
terns of activity in each neuron, even with a net- 
work of small size. Literally thousands of contribu- 
tions in computational neuroscience and machine 
learning arc based on networks of these firing rate- 



Samuelides 2007) 



coding units (Rabinovich et al. 2006 Cessac and 



As powerful as RNNs are as a model class, re- 
cent contributions have questioned the usefulness 
of RNNs to explain neurobiological phenomena ob- 



served in real networks of neurons ( Spruston 2008 



Mel 


2008 


Debanne et al. 


2011 



20111. Here, we suggest 



that a simple re- interpretation of the functional role 
of RNN dynamics leads to a novel and potentially 
more plausible account of what RNN units may 
compute: We suggest that neuronal networks serve 
as Bayesian decoders of dynamics caused by the en- 
vironment. For example, in action observation, hu- 
mans decode the kinematics of other people from 
visual input dynamics. For Bayesian recognition, 
one needs a so-called generative model for the hid- 
den dynamics of the environment which cause sen- 
sory input to the brain. We suggest that RNNs are 
an ideal generative model for these hidden dynam- 
ics in our environment. The task of the recognition 
system is to decode the sensory input generated 
by the hidden RNN dynamics. To do this, we de- 
rive Bayesian update equations from the generative 
RNN model and call these 'recognizing RNN'. The 
difference to a standard RNN is that each unit in 
the recognizing RNN computes more sophisticated 
updates involving predictions and prediction error 
messages from other units in the network. Here we 
show that a recognizing RNN can decode real-world 
dynamics (human kinematics) and display several 
features which can also be observed with real neu- 
ronal systems, e.g., the online decoding of hidden 
dynamics in the environment, computation of pre- 
dictions and prediction error, robustness to noisy 
input and fast adaptation to sudden changes in the 
environment. These features are not only general 
hallmarks of brain function but may, in principle, 
also be useful for machine learning applications for 



decoding dynamics in an online fashion. 

In computational neuroscience, models of recur- 
rently connected networks of neurons, which opti- 
mally estimate dynamically changing states from 
noisy observations, have recently been proposed 
( Rao[ |2004| peneve et al.[ |2007| Natarajan et al 



2008! [Wilson and Finkel[|2009||BoerUn and Deneve 



12011). While these models provide important in- 
sights, results were reported for relatively restric- 



tive conditions such as linear dynamics (Deneve 



et al. 


2007 


Wilson and Finkel 


2009 


Boerlin and 


Deneve 


20111, discrete states ( 


Elao 


2004 


Deneve 


et al. 


2007 


Boerlin and Deneve 


2011 


, or a one-di- 



mensional state (Natarajan et al. 2008 Wilson and 



Finkel 2009 Boerlin and Deneve 2011 1. Although 



Natarajan et al. (2008) allow for nonlinear dynam- 
ics they assume knowledge of an ideal observer 
which provides an instantaneous error signal for 
learning of network connections. Similarly, reser- 



voir computing approaches (Jaeger 2001 Maass 



et al. 2002 Verstraeten et al. 



2007 1 rely on a teach- 



ing signal which provides a desired output at ev- 
ery point in time during learning. In contrast, we 
propose an approach combining multi-dimensional, 
continuous-time hidden nonlinear dynamics where 
learning proceeds without an externally provided 
error signal. Our main contribution is to demon- 
strate that a recognizing RNN is well suited to rec- 
ognize dynamic stimuli and may be used as a func- 
tional model for neuronal ensemble dynamics. In 
particular, we will illustrate this by showing that 
the prediction errors computed by a recognizing 
RNN provide sufficient information to discriminate 
dynamic stimuli, in an online fashion. 

The present approach may also lead to a better 
understanding of the role of recurrently connected 
networks of neurons in the brain: predictive cod- 
ing has been suggested as a theory for hierarchi- 
cal processing in the brain in which different levels 
exchange prediction and prediction error messages 



( Mumford 


1996 


Rao and Ballard 


1997 


1999 Fris- 


ton and Kiebel 


2009 




Rao and Ballarc 


( 


1997 


) al- 



ready described RNN-like dynamic models to im- 
plement predictive coding for static stimuli. The 
present approach can be seen as an extension to 
Rao and Ballard's original work to provide infer- 
ence for dynamic stimuli by resorting to approxi- 
mate inference methods for nonlinear, continuous 



dynamic models ( Friston et al. 2008 ) 



The remainder of the paper is organised as fol- 



2 



lows. In Materials and Methods we (i) present 
RNNs as generative models, (ii) describe the 
Bayesian inference framework and (iii) show that 
dynamic updates of the posterior state critically de- 
pend on prediction error. We illustrate the rRNN 
approach using human motion capture data. In Re- 
sults we demonstrate that the rRNN can success- 
fully recognize human kinematics and discriminate 
between different walking styles based on the pre- 
diction error of rRNN units. 



2 Materials and Methods 

In the following, we will describe the two key ele- 
ments of the present approach: a recurrent neural 
network (RNN) as a generative model of the sen- 
sory dynamics and the Bayesian inference frame- 
work to derive the update equations for a recogniz- 
ing RNN (rRNN). Subsequently, we will apply the 
rRNN technique to the recognition of human kine- 
matics, for which we describe the kinematic data 
and the rRNN settings. 

To motivate the present approach, we will start 
with a brief summary of the conventional RNN 
technique as used in machine learning for classifica- 
tion of stimuli. Note, however, that it is not our aim 
to compare discrimination performance of conven- 
tional and recognizing RNNs. Rather, description 
of the conventional RNN is given as a reference for 
understanding the conceptual differences between 
the two approaches. 

2.1 Conventional Recurrent Neural 
Network 

The RNN technique has been used in many ma- 
chine learning applications such as classification of 
static or dynamic stimuli, or time-series predic- 
tion. This approach has a long history which took 
off with the development of a supervised learning 



An example of such a network is discussed in 



routine (Pearlmutter 1989 Williams and Zipser 



1989). Recently, this learning approach has been 



complemented by the so-called reservoir computing 



technique (Jaeger 2001 Maass et al. 20021 



In general, in a conventional RNN, sensory units 
provide input, which drives the dynamics of the 
hidden units (see Fig. [l]^). Output units readout 
the result of the dynamic computations based on a 
mixing of the sensory and hidden states. 



( Jaeger et al. 2007 1 where the continuous-time dy- 



namics based on leaky-integrator units is given by 
ii = /(x)i 

= h {-axi + tanh([W'"y + Wx + W^^o],)) 

(1) 



where Xi is the state of hidden unit i G {1, . . . , H}, 
units, o G 



are the states of the input (sensory) 



pDxl 



are the states of the readout units, 
W G M.^^^ is a weight matrix defining the inter- 
action between the H hidden units, similarly W" 
and W*^ define the connections from the input to 
the hidden units and the (optional) feedback con- 
nections from the readout units, respectively, ki is 
a rate constant for unit i and a is the amount of 
leakage. The output states are determined by 



o-V[x^^',y^l 



(2) 



where V G is a weight matrix. 

In a conventional RNN, the overall flow of in- 
formation is from sensory to output units, because 
the RNN serves as a model for neuronal dynam- 
ics (hidden states) which are used to compute, e.g., 
a classification of the sensory input. We now use 
the same dynamics where we reverse the flow of 
information to model the generation of sensory dy- 
namics by hidden states of the environment (e.g. 
body movements cause visual output dynamics). 



2.2 



Generative 
Network 



Recurrent Neural 



Our overall aim is to construct a recognition system 
which can recognize its sensory observations based 
on its internal dynamics. For a Bayesian recogni- 
tion system we require a dynamic generative model, 
for which we choose a RNN. This 'generative recur- 
rent neural network' (gRNN) runs independently of 
any input and generates sensory data, i.e. observa- 
tions. Note that, in comparison to a conventional 
RNN (Eq. [T]), here the sensory units become the 
output of the network while no input units are de- 
flned (hence the missing units which acted as out- 
put in the conventional RNN). Consequently, the 
flow of information is reversed in the gRNN and 
its autonomously running hidden dynamics drive 
its sensory units (see Fig. [lj3). In particular, we 
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ABC 

output 
unit{s) 




O "'^ o ^ 

Figure 1: Comparison of different RNN architectures. (A) conventional RNN, (B) generative RNN 
(gRNN) and (C) recognizing RNN (rRNN). Each RNN has dynamic hidden units, but the overall direc- 
tion of information flow differs (indicated by the grey triangles). The conventional RNN is designed to 
compute an output given sensory input. In contrast, the gRNN computes sensory states. Finally, the 
rRNN computes predictions (black arrows) and prediction error messages (red arrows) to recognize the 
hidden causes that generated the sensory input. 



define a gRNN as 

±i = /(x)i = ki {-axi + tanh([Wx]i)) (3) 

y = Vx (4) 

where now V e M^^^ linearly translates hidden 
states X into sensory states y. This gRNN com- 
putes sensory output y as caused by a hidden, dy- 
namic process defined by the RNN dynamics /(x). 
In the following section, we describe how a recogniz- 
ing recurrent neural network (rRNN) is constructed 
from the gRNN using Bayesian inversion. This 
rRNN receives sensory observations (as in a con- 
ventional RNN, Eq. [T]) and infers about the hidden 
states that caused these observations. Effectively, 
conventional RNN computations are aimed at do- 
ing the same (c.f. Fig. [T|\_,C); however, the update 
equations of an rRNN are explicitly derived for this 
recognition task. 



state transitions are noisy, or uncertain, Bayesian 
inversion of the generative model leads to updates 
of the hidden states which make them an opti- 
mal representation of the observations given the 
model. For example, the well-known Kalman- 



Bucy filter (Jazwinski 1970) implements such a 



Bayesian inversion scheme for linear dynamic pro- 
cesses. The gRNN uses highly nonlinear dynam- 
ics (Eq. [3| and, therefore, we require approxi- 



van der Merwe 2001 Doucet et al. 2001 Friston 



mate inversion schemes (Jazwinski 1970 Wan and 



et al. 2008 Daunizeau et al.) 2009 Friston et al 



2010 1 . Here, we derived the update equations using 



the D-step of Friston's dynamic expectation maxi- 



mization (DEM) framework (Friston et al. 2008). 
This choice was based on our previous experience 
with inversion of continuous-time dynamic models 
using DEM ( [Kiebel et aT] |2009b[ ) but, in princi- 
ple 



2.3 Recognizing 
Network 



Recurrent Neural 



Generative models like a gRNN (eqs 
used to derive a Bayesian inference system that 
recognizes input caused by the generative model 



In the probabilistic setting, when observations and 



other inversion schemes could be used as well. 
DEM uses generalised coordinates, local linearisa- 
tion and point-estimates at strategically important 
positions. See the appendix for a high-level deriva- 
tion of the algorithm and an explanation of gen- 
3p ) can be eralised coordinates which are a dynamically ex- 
tended representation of state variables, the use of 
which we indicate by a tilde in the subsequent for- 



4 



mulas. 

In the following, we will briefly describe the key 
computations performed by DEM. This description 
is aimed at giving an intuitive description of the up- 
date equations governing the rRNN and will allow 
interpretation of these updates in terms of predic- 
tion and prediction error messages. 

The most important equation resulting from in- 
version with DEM describes the evolution of the 
posterior mode of the hidden states in generalised 
coordinates and is given by 



D5 



(5) 



The motion defined in this equation consists of two 
parts: 1) Dx which, in absence of other contribu- 
tions, implements that the motion of the posterior 
mode follows its local trajectory represented in gen- 
eralised coordinates using a derivative operator D 
and 2) the derivative of the variational energy V^(x) 
with respect to hidden states which acts as a correc- 
tive force to make the motion consistent with the 
gRNN and the observations. With fixed parame- 
ters, the variational energy is the log-joint proba- 
bility of observations (sensory states) y and hidden 
states X which defines the probabilistic gRNN. In 
particular, the variational energy is given by 



F(x) = logp(y,i|i 

= iogp(y|x, 

= logp(x|y, 



-logp(Dx|f,0) (6) 
- c 



where c is a constant, is a vector consisting of 
all parameters of the model and f are the dynamic 
predictions defined by Eq. [3] in generalised coordi- 
nates. The last term illustrates that the updates 
are a dynamic form of maximum a posteriori es- 
timation of hidden states. Gaussian distributions 
are assumed for the state transition and observa- 
tion densities: 



p(y|i,0)^AA((I®V)5c,S, 



(7) 
(8) 



where (1(g) V)x is the predicted sensory state given 
the hidden states as defined by Eq. [4] in generalised 
coordinate^ and f^x and Sy are the prior covari- 
ances of sensory and hidden states in generalised 

^ I is the identity matrix with size equal to the number of 
used generalised coordinates and (gi is the Kronecker product 



coordinates, respectively. This leads to a simple in- 
terpretation of the posterior mode updates in terms 
of prediction errors. In particular, the gradients of 
these densities with respect to hidden states be- 



aiogp(Dx|f, 
9x 



aiogp(y|x, 
dx 



ld_ 

dix 
di 



(9) 



2di y y y 

T 



dSy 

(9x 



(10) 



where the prediction errors are defined as 



Er = Dx — f 



ey^y-{I(g) V)x. 



(11) 



This means that the updates of the posterior hid- 
den states follow the gradient of the prediction error 
with step sizes determined by the prediction error 
itself weighted by the prior precisions. The con- 
tribution from the prediction error on the sensory 
states, iy, ensures that the sensory states are well 
explained by the hidden states while the contribu- 
tion from the prediction error on the hidden states, 
ix, ensures that the posterior dynamics of hidden 
states as encoded by the generalised coordinates 
is consistent with the learnt model dynamics. In 
particular, for the first generalised coordinate, the 
prediction error 



(12) 



ensures that the posterior velocity corresponds to 
the learnt, noise free hidden unit dynamics as de- 
fined in Eq. [3j Conversely, we will argue below that 
a consistently large prediction error Sx provides ev- 
idence for an inconsistency between observed and 
learnt dynamics and can be used to discriminate 
among different dynamic stimuli. 

The question remains how the system got to 
know a suitable gRNN which generates specific sen- 
sory dynamics. In our experiments we let the sys- 
tem learn its generative model by adapting connec- 
tivity parameters W, V and rate constants k using 
an approach which was developed for the identifi- 



cation of dynamical (neural-mass) systems ( Friston 
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et al. 2003 Kiebel et al. 2009a I and is based on 



maximum a-posteriori estimation of the parame- 
ters (iFristonl 120021 iFriston et all 120021). See the 



appendix for details. Note that this initial learn- 
ing step is not our main point in this paper; any 
learning approach that successfully learns hidden 
gRNN dynamics to represent a given dynamic stim- 
ulus could be used here (e.g. 
2001[ Roweis and Ghahramani [ 
Karhunen', 2002 



Wan and Nelson[ 
20011 IValpola andj 



Doucct and Tadic, 2003; Archam- 
beau et al. |2008f [Friston e t al. , 2008j |Daunizeau| 
Kantas et all [2009 ; Laz ar et al.| [20091 



et al.l 



2009 



Schon et al. 



2011), but also standard RNN learn- 



ing may be used, if the hidden state dynamics is 
assumed to be deterministic during learning. 

2.3.1 Message passing in the rRNN 

The updates defined by equations [5| [9| [lO| and [Tl] 

can be interpreted as network dynamics based on 
messages sent by sensory and hidden units. Alge- 
braically, this can be seen by exemplarily inspect- 
ing the observation density update equation, Eq. 
[TOl for the first generalised coordinate of a single 
hidden unit i: 



aiogp(y|x, 

dxi 



dx,y ^y 



E 



dx, 



^^y 



(13) 



where the sum over j runs over sensory units yj 
in generalised coordinates. Note that the partial 
derivative of the prediction error of sensory unit j 
with respect to the state of hidden unit i describes 
how a change in the state of unit i effects the pre- 
diction error of unit j. Therefore, the state update 
for hidden unit i is a weighted sum of these predic- 
tion error gradients where each element of this sum 
corresponds to a 'prediction error message' from a 
single sensory unit j. To compute the prediction 
error message a sensory unit first has to compute 
a prediction. This is done using the forward equa- 
tion (|4| of the gRNN which is a weighted sum of the 
hidden states x where the weights are determined 
by the connectivity of the gRNN. In the following 
we call the elements of this sum 'prediction mes- 
sages' which are sent from a hidden unit x^ to a 
sensory unit j/j . In summary, the update equations 
define a recognizing RNN (rRNN) where a hidden 



unit sends prediction messages to connected sen- 
sory and hidden units such that these can compute 
prediction error messages which are returned to the 
hidden unit to update its state (see also Fig. [ip). 
The updates resulting from the dynamics density, 
Eq. [9] follow the same logic, where the hidden unit 
Xj takes the place of sensory unit yj . Each hidden 
unit, therefore, sends and receives two kinds of mes- 
sages: prediction and prediction error messages. 

2.3.2 Induced connectivity of the rRNN 

The connectivity matrices W and V of the gRNN 
(Eq. [3| [4|) are not necessarily the same as in the 
rRNN. Generally, the rRNN will have all connec- 
tions of the gRNN plus the corresponding recip- 
rocal connections, plus some additional ones. To 
see this, note that the prediction error messages in 
the rRNN in Eq. 13 are 0, when hidden unit i is 
not connected to sensory unit j, i.e., when hidden 
unit i has no direct influence on the computation 
of predictions in sensory unit j (then ^^^^^ = 0). 
Only sensory units which receive a connection from 
a hidden unit i in the gRNN will contribute mes- 
sages containing the derivative of the prediction er- 
ror. However, in the rRNN, sensory units j, which 
are not connected in the gRNN to a hidden unit i, 
may also contribute messages, containing only their 
prediction error, through the weights computation 
Wj = [^y^ey]j. In particular, if the j-th row of 
the sensory precision matrix, has nonzero en- 
tries in positions other than j, e.g., fc, the weight of 



sensory unit j in the update equation (Eq. 13 ) de- 
pends on the prediction error of unit k. In this case, 
sensory unit k contributes to the update of hidden 
unit i, even though hidden unit i is not connected 
to sensory unit k in the gRNN. This means, that 
there is an additional connection from sensory unit 
k to hidden unit i in the rRNN. 

In conclusion, only if the covariance matrix 
is diagonal, the connectivity matrix of sensory to 
hidden units in the rRNN will only contain those 
connections which are reciprocal to the hidden to 
sensory unit connections in the gRNN. Conversely, 
if there are off-diagonal entries in Ey, there will be 
corresponding additional connections from sensory 
to hidden units in the rRNN, relative to the gRNN. 
The same considerations apply to the connectivity 
between hidden units. In summary, the connectiv- 
ity of the rRNN directly follows from the gRNN, 
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only if the units' states in the gRNN are a priori 
independent. For simphcity, this case is shown in 
Fig. [Tp and used in the following simulations. Note 
that a diagonal covariance matrix T,y is a natural 
assumption for the present data because we assume 
that the measurement noise is white and any cor- 
relation among observations is caused by the un- 
derlying dynamics which are modelled by the RNN 
dynamics. 

2.4 Human Movement Data 

We use human movements to demonstrate the 
properties of the rRNN in the experiments be- 
low. The kinematics of humans is highly dy- 
namic and nonlinear through complicated inter- 
actions between individual joints. Kinematics, 
therefore, provides a good example of the kind of 
complex, dynamically changing, real-world stim- 
uli which can be modelled using rRNNs. Here, 
we used three walks of the same subject, each 
of which expresses a different walking style (cat- 
egorized as 'childish', 'depressed' and 'shy'; freely 
available from the CMU motion capture database. 



|http://mocap. cs.cmu.edu subject 142, motions 1, 
5 and 19). We chose this particular subject because 
a large range of different movements were available 
among which we chose the selected walks because 
of their similar time-scales. The advantage of us- 
ing motion capture data as compared to video is 
that we can focus on modelling the kinematics of 
the subject in terms of changing joint angles with- 
out the need to model detailed processing of visual 
information. 

For each walk we removed the global translation 
of the body and computed the 3D positions of the 
joints and extremities for all time points. This re- 
moved potential 'jumps' introduced by the circu- 
larity of the joint angles. As a result we obtained a 
set of 30 points moving in 3D space (see Fig. [2] for 
an example). Subsequently, we selected four rep- 
resentative seconds of data starting when the left 
foot touched the ground for each walk and sub- 
sampled the data using 30 frames per second re- 
sulting in = 120 data points per walk. These 
data covered roughly two footsteps for each move- 
ment. We then found a common, low-dimensional 
representation for the three walks using principal 
component analysis (of all three walks combined) 
which reduced the dimensionality from 90 dimen- 



sions to _D = 5 (maintaining 95.5% of the original 
variance). Additionally, we scaled the coordinates 
of each walk such that the maximum absolute value 
in each dimension was 1 over all walks. In sum- 
mary, we obtained for each of the three walks a 
sequential data set containing five trajectories each 
consisting of 120 time points, see Fig. [3] 

2.4.1 Learning of generative RNNs 

For each of the three walks, we constructed one 
gRNN by learning suitable parameters W, V and 
k (eqs. [3]and|4]) so that the dynamics of each gen- 
erative RNN replicated the movement data. Each 
RNN had five sensory units, each of which gener- 
ated one of the scaled principal component coor- 
dinates. In initial tests, we found that a network 
with H — 12 hidden units was the smallest network 
which gave consistently acceptable learning results 
and we consequently used this network size in our 
experiments. These tests also showed that good 
learning results were obtained, if the hidden units 
were sparsely connected. In particular, we fixed 
2/3 of all connections in W and 1/3 of all con- 
nections in V to 0. Other entries in W, the rate 
constants k and the initial vector of hidden unit 
states x(0)' were chosen randomly before learning 
while V was initially chosen to correctly predict the 
first data point of a walk given x(0)'. For details 
of this initialisation and the learning procedure see 
the appendix. Note that any learning procedure 
could have been used here. The main point made 
by this initial learning step is that a dynamic rep- 
resentation for each walk can be found using RNNs 
with few units. 

The sensory state trajectories of the learnt 
gRNNs are shown in Fig. [3| Each of the three dif- 
ferent walks was learnt well: The amount of vari- 
ance explained for each walk was 99%, 97% and 
97% for the childish, depressed and shy walks, re- 
spectively. 



3 Results 

Here we demonstrate the utility of Bayesian infer- 
ence for RNNs for online recognition of dynamic 
stimuli. As a proof of principle we apply the ap- 
proach to the multi-dimensional, nonlinear kine- 
matics of a walking human. We will first show 
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childish depressed shy 




Figure 2: Example frames from the three different walking styles: childish, depressed and shy (left to 
right). In our experiments we used the first five principal component coordinates of the motion capture 
3D joint coordinates (indicated as filled circles) as observation variables. Lines are plotted only for 
visualisation purposes. 




Figure 3: Dynamics of three different human walks and model fits in principle component space (five 
components). Dotted lines: Original dynamics, Solid lines: Trajectories generated by a gRNN after 
learning. While the fit between data and its gRNN replications was not perfect, it was sufficiently close 
such that the gRNN was an appropriate generative model for recognition (see Fig. Isl). 
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that recognizing RNNs quickly and successfully rec- 
ognize the hidden dynamics, i.e. decode the type 
of movement. Then, we will demonstrate that the 
prediction errors of the hidden units can be used to 
discriminate the three different walks. Finally, we 
will show that the rRNNs are robust against noisy 
observations and initial conditions. Note that all of 
the following experiments with the rRNN use the 
original motion capture data as observations. 



rately predicted all subsequent observations while 
hidden unit trajectories followed those typical for 
the learnt gRNN to a large extent. (Note that these 
results partially depend on an appropriate choice of 
the prior covariances, see Appendix C.) This means 
that the rRNN can represent the dynamic reper- 
toire of the gRNN but, in addition, can rapidly 
switch to the specific dynamic regime that best ex- 
plains the sensory input. 



3.1 Fast Recognition of Dynamics 

In this section, we show that the rRNN can quickly 
start recognizing a movement online. In particular, 
we show that this 'quick response' is robust against 
the initial hidden states at the beginning of the 
recognition process. This robustness is obtained 
despite the fact that gRNNs have a large depen- 
dence on their internal initial conditions. This is 
because RNNs are in general rich dynamic models 
which are capable of simultaneously representing 
many different dynamic stimuli depending on their 
initial conditions (hidden states). We demonstrate 
this for the gRNN for the childish walk. This gRNN 
was initialised during learning with the state x(0)'. 
When this specific gRNN is started, after learning, 
in this state, the learnt shy walk is generated as 
shown in Fig. |3](left). However, when we initialise 
the same gRNN with a state x(O)'' = x(0)' + e, 
which was perturbed by noise of the same size as 
the natural variability of the hidden states, it gen- 
erated very different trajectories of sensory states 
y as well as hidden states x as shown in Fig. |4] In 
other words, for deviating starting conditions, the 
gRNN generates dynamics that look different from 
the learnt kinematics and, when plotted in motion 
capture space, can deviate severely from a natural 
walk. 

In contrast, the rRNN based on this gRNN for 
the shy walk was robust against such differences 
in initial conditions. Even though we perturbed 
the rRNN initial states severely, the rRNN al- 
ways switched rapidly to the appropriate dynam- 
ics which best described the sensory input of a shy 
walk. In other words, the prediction error updates 
of the hidden units forced the dynamics on a tra- 
jectory which predicted the observed walk. We de- 
pict a characteristic example of this quick response 
behaviour for the rRNN (childish walk) in Fig. [5] 
(A,B). After only one time step the rRNN accu- 



3.2 Discrimination of Dynamic 
Stimuli 

After learning, we have three different rRNNs, each 
of which has learnt to predict one of the three walk- 
ing styles childish, depressed and shy. Here, we 
will show that the prediction error, on observations 
Ey — y — Vx or hidden states Ex = x — /(x) (Eq. 
12 1, of all three rRNNs can be used to discriminate 
between the three different walks. In particular, 
we will show that the dynamic prediction errors 
are smallest for the rRNN that has learnt a specific 
walking style. This means that a potential readout 
mechanism can use the relative amplitudes of pre- 
diction error of the three rRNNs to decide which of 
the three walks is currently observed. 

Fig. [5] (D-I) shows the response of the rRNN 
which learnt the childish walk, but now given the 
depressed and shy walks as observations. Although 
this rRNN did not learn these walks it represented 
them well by exploiting alternative dynamics em- 
bedded in the twelve unit network. However, the 
rRNN frequently had to use prediction error on its 
hidden states to explain away the remaining mis- 
match between internal predictions and actual in- 
put. Sec Fig. [5] (C,F,I) for this relative increase in 
prediction error in response to the non-learnt de- 
pressed and shy walks. This increase in prediction 
error when recognizing the two non-learnt walks is 
consistent over the three different rRNNs and may 
be used to discriminate dynamic stimuli as shown 
in Fig. |6] (D-F). For each of the three walks the 
prediction error was smallest for the rRNN which 
actually had learnt this specific walk, see also Ta- 
ble [l] The prediction errors on observations showed 
this effect as well, although not as clearly (Fig. [6] 
A-C). 

We also investigated the effect of learning on the 
accumulated prediction errors by comparing the 
prediction errors of the learnt rRNNs with those of 
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Figure 4: Influence of initial hidden states on dynamics of generative RNN. Shown are trajectories 
of sensory (A) and hidden (B) states of the generative RNN for two different initial hidden states. 
Dotted trajectories resulted from initial hidden states used during learning (x(0)') while solid trajectories 
resulted from random initial hidden states (x(O)''). In this example we used the RNN parameters learnt 
for walk 1 (childish), but results are qualitatively similar for other RNN parameters. 



Table 1: Absolute prediction errors summed over time points and sensory, or hidden states, respectively. 
Top: sensory state prediction errors. Bottom: dynamic hidden state prediction errors. Each column 
presents results for each of the three different rRNNs on one of the three data sets childish, depressed and 
shy. rRNN (random) : average accumulated prediction error obtained from 30 random rRNNs (values in 
parentheses show the minima) . rRNN with lowest prediction error on each data set is indicated by bold 
font. Note that we excluded the first four out of 120 time points from these sums, because the initial 
transient period otherwise may distort the results. 



sensory state prediction errors 





childish 


depressed 


shy 


rRNN (childish) 


1.44 


5.58 


6.81 


rRNN (depressed) 


1.69 


0.43 


1.53 


rRNN (shy) 


2.89 


2.66 


0.56 


rRNN (random) 


4.37(2.37) 


4.30(2.20) 


4.21(2.57) 


hidden 


state prediction errors 






childish 


depressed 


shy 


rRNN (childish) 


0.85 


5.15 


7.38 


rRNN (depressed) 


5.95 


1.06 


4.05 


rRNN (shy) 


7.39 


6.44 


1.48 


rRNN (random) 


4.96(3.04) 


4.86(3.13) 


4.65(3.03) 
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Figure 5: Result of the three different walks recognized by one of the rRNNs (childish walk). (A,D,G) 
show the presented data (dotted lines) and the predicted sensory states (solid lines). (B,E,H) show the 
posterior hidden states (solid lines) and, for comparison, the hidden states of the corresponding gRNN 
when run autonomously from the initial states used during learning (dotted lines, cf. dotted lines in Fig. 
|4|3). (C,F,I) show the dynamic prediction errors of the hidden states (Eq. 12 note that these prediction 
errors do not correspond to the difference between solid and dotted lines in the middle panels). The 
different rows of panels correspond to the different walks which were recognized (from top to bottom: 
childish, depressed and shy). Prediction errors were markedly lower, when the rRNN recognized the 
walk it was adapted for (C vs. F,I). 
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Figure 6: Comparison of absolute prediction errors. Each panel shows summed (over state dimensions) 



absolute prediction errors of sensory (A-C) and hidden (D-F, Eq. 12) states of the three different rRNNs 
when data from one of the three different walks were observed. Each rRNN corresponds to one colour 
(childish: black, depressed: red, shy: yellow). Prediction errors of the rRNN, which has been learnt for 
the observed type of walk, are indicated by thick lines. 
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random rRNNs. We generated 30 random rRNNs 
by drawing random parameters W, V and k while 
using the same connectivity constraints as for the 
gRNNs which were used for learning the walks. 
The accumulated prediction errors for the random 
rRNNs, thus, give an estimate for the total amount 
of prediction error expected in a random rRNN, 
i.e. without learning. As expected, the predic- 
tion errors of random rRNNs were always higher 
than those of the rRNNs with learnt parameters 
(see Table [T]). Furthermore, for non-learnt stim- 
uli, the learnt rRNNs often produced larger pre- 
diction errors than random rRNNs. This indicates 
that learning a specific walk restricts the dynamic 
repertoire of an rRNN. We conclude that the learn- 
ing procedure resulted in rRNNs which were suited 
to discriminate the walks. 

In an additional experiment we concatenated 
data from all three walks into a single sequence 
to simulate online recognition of three walks, see 
Fig. [7] The resulting saccade-like, abrupt tran- 
sitions between walking styles led to a transient 
increase in prediction errors correctly signalling a 
discrepancy between predictions and actually ob- 
served kinematics. Furthermore, we implemented a 
simple readout mechanism for dynamic prediction 
errors using a filter which sums the absolute predic- 
tion errors over the last 30 time points and weights 
recent time points more strongly. This operation 
smoothes prediction errors temporally and stresses 
differences that stretch over a similar period as the 
filter size, see Fig. [7j After each transition, the 
rRNNs reduced their smoothed prediction errors 
quickly until the rRNN with parameters learnt for 
the currently observed walk was the only one for 
which the magnitude of prediction errors fell below 
an ad-hoc threshold. This shows that the present 
approach can be successfully used to recognize a 
specific walk by choosing the model with the low- 
est prediction error, after some initial transient has 
died away. 

3.3 Robustness against Noise and 
Initial Conditions 

Here, we demonstrate that the recognition scheme 
is robust to both noise and variations in initial con- 
ditions. We repeated the experiments above for 
increasing amounts of white observation noise and 
twelve different, randomly chosen sets of initial con- 



ditions, see Fig. [8j We found that the overall mag- 
nitude of dynamic prediction errors is proportional 
to the amount of observation noise. This indicates 
that observation noise is explained away by predic- 
tion errors of both sensory and hidden units. Im- 
portantly, the discrimination ability of the three 
rRNNs is maintained up to moderate amounts of 
noise, i.e., prediction errors still contained sufficient 
information to discriminate the three walks. As ex- 
pected, for large amounts of noise, the contribution 
from observation noise eventually masked the pre- 
diction error contributed by the difference in walks. 
Also note that the dynamic prediction errors of the 
learnt rRNNs on their learnt walks (bottom trajec- 
tories in the panels of Fig. |8]) had very low variabil- 
ity across initial conditions. This means that the 
rRNN, which was learnt for a specific walk and ob- 
serves this walk as input, was much less dependent 
on its initial conditions than the rRNNs learnt on 
different walks. Yet, the variability of prediction 
error due to initial conditions within each rRNN 
was not large enough to influence the result of dis- 
crimination of the walks up to moderate amounts of 
noise. In other words, in our experiments accumu- 
lated prediction errors of the rRNN learnt for the 
current walk were always smaller than those of the 
other rRNNs (up to moderate amounts of noise), 
even when beneficial initial conditions for them led 
to better than average accumulated prediction er- 
rors. 



4 Discussion 

In this paper we have described the 'recognizing 
recurrent neural network' (rRNN), which is a re- 
current neural network where each unit computes 
both predictions and prediction errors to recognize 
sensory input in a Bayes-optimal fashion. We de- 
rived the update equations of both sensory and 
hidden units by using an approximate Bayesian 
inference framework for nonlinear dynamical sys- 
tems, i.e., dynamic expectation maximization (Fris- 



ton et al. 2008 ) . The rRNN approach is motivated 
by two considerations: Firstly, in an rRNN, the 
additional prediction error messages, as compared 
to a conventional RNN, increase the complexity of 
the messages passed between units in the network. 
We propose that these computations may be bet- 
ter descriptions than conventional RNNs of what 
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Figure 7: Filtered Dynamic Prediction Errors. We concatenated sensory input from all walks into a 
single sequence and inferred the hidden states for all three rRNNs. Shown are temporally smoothed, 
summed absolute prediction errors for the rRNNs learnt on the childish (black), depressed (red) and shy 
(yellow) walks. We also plotted an ad-hoc decision boundary (dashed, grey line) which could be used to 
select models with a low prediction error. 
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Figure 8: Dependence of dynamic prediction errors on noise and initial conditions. Each panel shows 
average sums of absolute prediction errors for the three different rRNNs on one of the walks. The averages 
are over 12 randomly chosen initial states x(O)'' and shading indicates the region around the mean of 
twice the standard deviation. The x-axis indicates the standard deviation of independent Gaussian noise 
added to the principal components of the walks on a log-scale. Note that the exponential increase of 
prediction errors with noise in the log-plot means that prediction errors depend approximately linear on 
the observation noise magnitude. 
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real neuronal ensembles implement. We base this 
hypothesis on recent findings and theoretical con- 
siderations which show that single neurons (and 
consequently neuronal ensembles) compute much 
more complex functions than previously thought 



( Sidiropoulou et al.| 


2006 


Spruston 




2008 




Mel 


2008 


Pissadaki et al. 


2010 


Debanne et al. 


2011) 



The general idea is that a single neuron may in 
principle compute complex, nonlinear and dynamic 
functions using its spatiotemporal voltage depolar- 
isations and other dynamics like calcium fluctua- 
tions (Mel 2008). Although it is yet unclear how 



the computation of predictions and prediction er- 
rors may map to cellular dynamics, intracellular 
dynamics may have in principle the computational 
complexity to perform Bayesian decoding of their 



synaptic input ( Deneve 2008 1 . Secondly, the com 



putation of predictions and prediction errors in an 
rRNN fits well with the recent reappraisal in cogni- 
tive neuroscience that the brain may use a predic- 
tive coding approach to perception and cognition 



( van Wassenhove et al. 


2005 Summerfield et al. 


2006 


Bar 


2009 


Friston and Kiebel 


2009 


). The 



rRNN approach is a mathematical description of 
how this predictive coding scheme could be im- 
plemented for complex, multi-dimensional dynamic 
stimuli. 

To illustrate that rRNNs may be an interest- 
ing model for understanding the brain function of 
recognition and prediction for naturalistic stimuli, 
we showed that rRNNs can robustly recognize kine- 
matics as observed with motion capture data. We 
found that the prediction error computed by an 
rRNN can be used to recognize and discriminate 
between different human walking movements in an 
online fashion. Furthermore, this recognition mech- 
anism is robust against both noise on the observa- 
tions and variations in the initial state of the rRNN. 
In other words, rRNNs may be used as functional 
models for human action observation studies, e.g. 
(iBlake and Shiffrarl 120071). The rRNN approach 



unifies many important aspects of brain process- 
ing such as statistically optimal inference in highly 
variable and noisy environments, recurrent connec- 
tions, online recognition of dynamics, and quick 
adaptation to sudden changes in the environment. 
Therefore, the rRNN approach may be a useful de- 
vice to bridge the gap between behaviour-driven 
models of cognition and neurobiologically plausible 
models of neuronal ensembles. 



The idea to use autonomous RNNs as genera- 
tive models is not entirely new. In previous work, 
we have used this approach in system identification 
where we explained neuroimaging data as gener- 
ated by a network of cortical nodes, called 'Dy- 



namic Causal Modelling' (DCM) ( Friston et al 
2003| [Kiebel et a\] [2056l |2009a| ). Critically, the 



equations governing the dynamics of each node 
took the form of a rate model as in Eq. |3] The dif- 
ference to the present approach is that DCM uses 
specific, highly constrained connectivity schemes 
based on neural mass models and does not allow 
for errors in the hidden states. Similarly, we used 



the present approach ( Friston et al. 2008 ) to model 



recognition of multi-scale dynamics (Kiebel et al. 



2008 2009b I where the rRNN generalizes these pre- 
vious contributions by using a more generic genera- 
tive model (RNNs) and learning of natural stimuli. 

To our knowledge, the explicit use of (generic) 
recurrent neural networks as generative models for 
recognizing dynamic sensory input using online 
Bayesian inference has not been described before. 
Both techniques, Bayesian inference for dynamic 
stimuli and artificial recurrent neural networks have 



existed in parallel for many years now (Jazwin- 



IsEl, ri lTOl fPearlmutterl [19891 |W illiams a nd Zipser 
1989; Narendra and Parthasarathy, 1990). We pro- 



pose the combination of these two approaches in 
which RNNs act as dynamic models in a nonlinear, 
Bayesian filtering framework. Indeed, this idea has 
already been used implicitly in the field of machine 
learning and control. For example, [Connor et al.| 



(1994) used a related approach in the context of 
autoregressive models to remove outliers from se- 
quences of discrete states which were represented 
by the hidden states of a RNN. Also, in dual ex- 



tended Kalman filter methods for RNNs (Wan and 



Nelson 2001) an extended Kalman filter is used 
to estimate RNN hidden states. However, these 
contributions focus on the usefulness of Bayesian 
filtering of RNN states to make conventional RNN 
learning more robust. Here we describe the idea 
that the combination of RNN equations and fil- 
ter updates can themselves be interpreted as net- 
work equations which are better suited for recog- 
nizing dynamic stimuli. Therefore, the present ap- 
proach also goes beyond previous suggestions of us- 
ing RNNs as functions approximating the update 



equations of a nonlinear filter (Parlos et al. 2001 ), 



or its output (Ting- Ho Lo 1994 1. We, thus, pro- 
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vide a novel perspective on the role of RNNs also 
in possible machine learning applications. 

We motivated the present approach by consider- 
ing the potential functional role of recurrently con- 
nected neuronal ensembles in cortical processing. 
This allowed us to address recognition of arbitrary 
nonlinear dynamics embedded in multidimensional, 
continuous stimuli - something that has not been 
reported with spiking neuron models of neuronal 
coding in recurrent networks ((Rao, 2004 Deneve 



et al. 2007 


Wilson and FinkelJ |2009j |Boerlin and 


Deneve 


2011 


). In contrast, the 'reservoir comput- 



ing' approach (Jaeger 2001 Maass et al. 2002 



Verstraeten et al. 2007) can recognize the class of 
stimuli we considered here. The reservoir comput- 
ing approach has reinvigorated RNN research by 
establishing that very large networks (hundreds to 
thousands of units) combined with a simple read- 
out function can be used to learn and recognize 
both dynamic and static stimuli. However, reser- 
voir computing approaches typically do not adapt 
the dynamics within the network and rely on the 
chance probability that, among the many units, 
there exist some dynamical regimes which are ap- 
propriate generative models of the data. Here, we 
describe an alternative approach and speculate that 
small networks of 'smart' rRNN units may be suf- 
ficient to recognize dynamic stimuli. 

Our use of RNNs as generative models of dy- 
namic stimuli requires learning of their parameters 
(W. V and k in eqs. |3j|4]). In particular, our results 
depend on learning the connections between hid- 
den units in the recurrent network (W). This type 
of learning has been proven to be difficult in the 



past (Hammer and Steil 2002). While the learning 



procedure used here was capable of learning a suffi- 
ciently good dynamic representation of the present 
walks alternative learning approaches may have to 
be used to achieve similar performance on other 
data sets. For example, the different principal com- 
ponent coordinates of our walks had similar time 
scales (Fig. [s]). More complex and longer move- 
ments may demand the use of hierarchical models 



and corresponding learning algorithms ( Hinton and 



Salakhutdinov[ [2006 ). 

While learning is an important issue with RNNs, 
we focused on providing a proof-of-principle that 
the rRNN approach can solve high-level problems 
such as discriminating visual dynamics in an online 
fashion. Importantly, our results appear to be ro- 



bust against sub-optimally learnt generative RNNs. 
This can be seen in Fig. [Sj which shows a residual 
difference between learnt trajectories and the in- 
put. In other words, we found that prediction error 
messages were sufficiently informative even though 
the sensory input observed by the rRNN deviated 
slightly from the internally predicted dynamics. We 
also found robustness of the discrimination against 
white observation noise, see Fig. [8| It is an open 
question whether the rRNN approach is also robust 
against structured variations in human movements, 
e.g. as induced by a variation of a specific move- 
ment. We speculate that such variations require 
a different generative model, either where multi- 
ple movements are embedded in a single gRNN or 
where one uses a hierarchical gRNN similar to the 



approach used in Taylor and Hinton ( 2009 ) 



A Bayesian Inversion of Dy- 
namical Models 

In this appendix we give a high-level description of 
the D-step in Friston's dynamic expectation frame- 
work ( Friston et al. 2008 1 which leads to Eq. [5] in 
the main paper. 

A.l Generalised Coordinates 

A major component of Friston's approach to 
stochastic processes is the redefinition of the time- 
dependent variables in generalised coordinates of 
motion. For example, one replaces x(i) with 



x(i) 



x(t) 



dt 



(14) 



and obtains for the probabilistic form of Eq. [3] 
(dropping the dependence on t for simplicity of 
writing) 



i = x' = /(x) H 

a^x _ „ _df_ , 
dt^ " ax"" 



a^x 



Of 



(15) 



Note that it is assumed that / is locally linear 
around x.(t) and that differently from the usual 
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stochastic process models dependencies between 
noise variables across time are allowed, i.e., it is 
assumed that the noise at two close points in time 
correlates and that the noise process €^{t) is dif- 
ferentiable sufficiently many times. In generalised 
coordinates of motion time-dependent variables en- 
code not only a state at the current time, but addi- 
tionally the future path of states. This is seen when 
we consider how the continuous representation here 
can be mapped onto a discrete sequence of N fu- 
ture observations y — [yi, ■ ■ ■ ,yN]'^ (for simplicity 
we here show only a single observed variable, i.e., 
D = l): 



E 



(16) 



where n is the highest order of motion considered. 
We assume that i — 1, . . . ,N are the times starting 
from t at which the data have been sampled. This 
formula represents a Taylor series approximation 
making use of the derivatives in y. jFnston et ah] 
(2008) have shown that the variance of the noise 



process quickly becomes very large for high order 
motions such that only a small number n of gener- 
alised coordinates and data points need to be taken 
into account at any point in time. One also needs 
to translate discrete data samples into generalised 
coordinates of motion. This can be done using the 
inverse operation of Eq. [16] Rewriting Eq. [16] in 
matrix form gives 



y = Ey 



O'-l)! (17) 
ie{i,...,N}, je{i,...,n}. 

li N = n, E is invertible and one obtains y{t) — 
E^^y. The resulting y(t) is then used to compute 
the likelihood of the data at t and make inference 
over the hidden RNN states Sc{t) as described be- 
low. 

A. 2 Dynamic Approximation of the 
Posterior Mode 

In generalised coordinates eqs. |3] and |4] become 

y = i + ey Di = f + (18) 

where D is a matrix differentiation operator which 
shifts coordinates upwards by one element, f = 



[f^f 



'T e"T 



and g = [g^,g^,g 



are 



the predicted generalised states and observations, 
respectively, with f = and g' = Vx' (anal- 

ogously for higher order terms f", . • • )■ Because g 
is linear here, one can write g = (I (E" V)x where 
(E) denotes the Kronecker product and I e M"^". 
Based on these equations the log-likelihood of the 
observations y{t) is defined as 

c{t) = \ogp{m 

= log J p{y,x\0)dyi j-^g^ 
-log / p(y|x,0)p(x|f,0)dx 



where is a placeholder for all parameters in the 
model. Notice that p(f(x)|Y), where Y represents 
previously seen data, has been approximated as 
5{i{jj,)) - a Dirac delta function at f evaluated at 
the previous posterior mode p, (see below). This 
means that only the mode is propagated through 



the dynamics, but not its uncertainty. Friston et al. 



(2008) then introduce a variational density (?(x) 



(ignoring the density over parameters as learning 
is not our objective) and make use of Jensen's in- 
equality to obtain 



Cit)> 



g X log T^^dx 



nq,t) (20) 



where J^{q, t) is the free energy which is a lower 
bound on the log-likelihood. The aim is to find 
q such that C{t) = F{q,t). In other words, one 
maximises J-{q, t) with respect to q. This is equiv- 
alent to minimising iirL[(7(x)||p(x|y,6')], the KL- 
divergence between variational density and true 
posterior, i.e., after optimisation q is an approxi- 
mation of the posterior density over RNN states. 



In particular, it can be shown (Ghahramani and 



imising F{q, t) is equal to 



1 



Beal[ |2001| [Friston et aTj [2008[ ) that the q max 



9(x) = ^ exp(V"(x)) 



1 
Z 

:p(x|y 



exp(logp(y,i|6l)) 



(21) 



where V{Sc) is called the variational energy. While 
this equation appears to be a trivial statement, the 



formulation of q in this way lets us recognize ( Fris- 
ton et al. 2008 1 that q also is the density defined 
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by a set of stochastically moving particles at their 
stationary solution where the movement of a single 
particle is given by 



dz 



dlogp{y,i\e) 
dz 



m (22) 



and r(i) is a random fluctuation. Using this re- 
lationship one can find q using Monte Carlo simu- 
lation as we can compute the partial derivative of 



logp(y,z|0). However, Friston et al. (20081 simpli- 
fied this further. In particular, a single particle in 
generalised coordinates with motion 



dVji) 



(23) 



will converge to the mode // of V, which is also the 
mode of the posterior, at a rate proportional to the 



constant k (Friston et al. 2008). Given the mode 
jl Friston et al. ( 2008 1 use a Laplace approximation 
for the posterior where q ^ A/'(/i, S) is defined to 
be Gaussian and the covariance S is found as 



= n = - 



d'^ logp(y,x|i 



(24) 



X— /X 



This is the inverse of the negative curvature of the 
posterior evaluated at the mode //. This completes 
the derivation of the approximate posterior over 
RNN states. 

Under the approximations made and given the 
linearity oi g one can identify the posterior p(x|y, 6) 
as being Gaussian exploiting that p{y\Sc,9) and 
p(x|f , 9) are Gaussian. In this case the Laplace 
approximation is exact. Nevertheless, we retained 
Friston's more general form which is also valid for 
nonlinear g. More importantly, this motivates the 
dynamic form of estimating the posterior mode in 
Eq. [23] which allows us to extend the static re- 
sult above to the dynamic case. In particular, note 
that all results above were obtained for only a sin- 



gle time point t. However, it can be shown (Friston 



et al. 20081 that the path integral of the free en- 



ergy is maximised, if Eq. [2T]holds for all t. Naively, 
this means that one has to integrate the motion of 
the particle in Eq. 



23 



to 



Table 2: Online algorithm for finding the approx- 
imate posterior of RNN states. 

initialise /i(0) 
FDR t = 1:7V 

1) compute predictions g(/i) and f(/2) 
from previous fl{t — At) 

2) find y based on n data points 
closest to t using Eq. |17| 

3) compute gradients of l^(x) using 
predictions, D/2(t — Ai) and y 

4) numerically integrate Eq 
get new fl{t) 

END 



Eq. |23| Intuitively, the representation in gener- 
alised coordinates of motion here helps to converge 
to a mode which better represents the data as it 
also takes the local motion (velocity, acceleration, 
etc.) of the mode into account. 

For the purpose of this paper we ignored the ap- 
proximated covariance and only concentrated on 
the posterior mode and the corresponding predic- 
tion errors. A summary of the resulting algorithm 
is shown in Table [2] We were able to ignore the co- 
variance, because we assumed network parameters 
to be fixed during inversion. However, in the full 
DEM-framework these covariances are needed for 
the computation of parameter updates. 

B Learning of RNN Parame- 
ters 

We want to adapt the RNN parameters W, V, k 
such that the observations generated by the RNN 
defined in eqs. |3|4| fit the data. We mainly follow 
the approach underlying dynamic causal modelling 



(Friston et al. 



23 until it converges to fi{t) is detailed in (Friston 



2003 



Kiebel et al. 2009a I which 



2002 Friston et al. 



2002). 



for each t. However, if the particle converges faster 
onto il(t) than p, moves itself, a condition which can 
be ensured by choosing an appropriate rate con- 
stant K, we will be able to track the motion of fl 
with a single particle and the dynamics given by 



This entails an iterative approximation of the pa- 
rameter posterior based on a first-order Taylor ex- 
pansion of an observation function vec(Y) = h{9) 
which represents the underlying dynamical system. 
Here, Y e M^^^ contains the observations at all 
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N time points and 9 = [vec(W)^, vec(V)^, k^]^. 
The RNN states are enclosed in h{6), because the 
dynamics is assumed to be noise free, i.e., de- 
terministic. Both parameter likelihood and prior 
are assumed to be Gaussian so that the following 
gradients of the log-posterior £ = \ogp{6\Y) oc 



on the observations Cy = e^l during learning. We 



logp(Y|6')p(6') are obtained (cf. |Friston[ |2002[ Eq. 
17) 



'do 



3'c: 



■ + Cg-i(/i,-^«) 



(25) 



j^c-ij 



We use these in a numerical integration scheme 
for nonlinear dynamical systems to obtain an up- 
date of the parameters dd based on the model 
d9/dt = dC/d9 and = ^(*) + d9. Here 9^''> is 

the maximum a posteriori estimate of the parame- 
ters in iteration i, [J]jk — d[h{9^^'>)]j /d9k is the Ja- 
cobian of h evaluated at 9^''^ , Cj^ is the covariance of 
the observations and iig and Cg are the prior mean 
and covariance of the parameters, respectively. Fi- 
nally, r = vec(Y) — h{9'-^^) are the residuals of the 
data not explained by the predictions h{9^'^'>) which 
are equivalent to the observation prediction errors 
£y described in the main text. In each iteration 
one obtains the predictions h{9'^^^) by numerical in- 
tegration of the RNN dynamics and the Jacobian 
J using numerical differentiation of h{9'^''^). 

In our experiments we divided learning into two 
phases - an initial phase in which we adapted pa- 
rameters only on local chunks of the data and a 
final phase in which we used the complete data. 
We found that the first phase helped to find a bet- 
ter initialisation of 9 for the optimisation on the 
whole data set in the second phase of learning. In 
the first phase we split the data into seven overlap- 
ping, equal size chunks and ran two passes through 
all chunks where we ran only two iterations of the 
update procedure described above per chunk and 
pass. In the second phase we ran 25 iterations with 
a fast, approximate numerical integration scheme 
for h and subsequently another 4 iterations with 
a slower, but more accurate scheme. While our 
choices for the number of chunks, passes and it- 
erations led to good results, we expect that many 
other values may be chosen equivalently. 

Embedded in each iteration there is also an ex- 
pectation maximization (EM)-like update of hyper- 
parameter A which determines the amount of noise 



refer the reader to (Friston 2002 [Friston et al 



2002 ) for details. A was initialised as 
perprior for A was Gaussian A ^ N{~ 



32. The hy- 

log(s2),l/8) 



where is the average variance of the observation 
variables in the data. 

We initialised the parameters contained in 6'(") 
as follows. The elements of k'"' were chosen uni- 
formly in the interval [1/8,3/8]. Randomly cho- 
sen 2/3 of all elements in W(") were fixed at 0, 
the remaining were drawn randomly from a stan- 
dard normal distribution. Furthermore, following 
( Jaeger et al.[ 2007[ ), we scaled the resulting ma- 



trix by W = 1/(0.955) W(°) to bring the initial 
RNN dynamics to a useful dynamical regime. 5 
here is the largest absolute eigenvalue of the matrix 
rfcW(o) — (1 — fca)I] where a is the leakage (cf. Eq. 
nl) and k — 1/4 is the expected value of ki for any 
i. The initial states a;j(0)' were chosen uniformly 
in [—2,2] for all j. We then found V^°^ as the so- 
lution to the underdetermined system of equations 
y(0) = V('')x(0)' using Matlab's backslash oper- 
ator, i.e., we found the least squares solution for 
V*^*^) with most elements of V'^°) equal to zero. A 
randomly chosen subset of these zero elements were 
also fixed during learning. The number of fixed el- 
ements was 1/3 of the total number of elements. 

In the initial learning phase we set the mean of 
the parameter prior to the described initialisation 
of the parameters /ig = 9^'^\ In the subsequent 
learning phase we set to the result of the 1st 
phase. The covariances of the prior parameter dis- 
tribution were chosen to be diagonal, but also dif- 
fered in the two phases of learning. In the initial 
phase we set the variances associated with the ele- 
ments of W to 1.6 • 10^ while we set the variances 
for V and k to 0.018 and 0.135, respectively. This 
enforced particularly the adaptation of the dynam- 
ical parameters. For learning on the full data set 
we chose these variances to be 7.389, 1 and 1 for 
W, V and k, respectively. 



C Prior Covariances 

For the rRNN the prior covariances, Sj, and S^^, 
modulate the size of updates of the posterior (cf. 
10 and [9]) and influence the result of the 



eqs 



Bayesian inversion. Intuitively, for large prior (co- 
) variances, i.e., a large amount of a priori expected 
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noise, smaller updates are made and larger predic- 
tion errors are tolerated. The amount of noise here 
has to be seen in comparison to the variance of the 
unperturbed states of the units in the gRNN. For 
the sensory states this corresponds to the variance 
of the movement data. The standard deviation of 
the sensory states across all walks averaged over 
the five input dimensions was 0.38 while the stan- 
dard deviation of the corresponding changes in hid- 
den states averaged over the 12 hidden units was 
0.04. In our simulations we assumed isotropic prior 
noise and correspondingly chose covariances of the 
form Sj, = '^yl n where I is the identity matrix 
and ay is our choice of standard deviation. We 
chose CTy = 0.3 and = 0.1 for sensory and hidden 
states, respectively. This means that we tolerated 
only relatively small prediction errors on sensory 
states while allowing for relatively larger prediction 
errors on changes of hidden states. This choice im- 
plements the natural prior belief that the variabil- 
ity of walk observations is mainly determined by 
the variability of the underlying dynamics. 
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